Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins
نویسندگان
چکیده
Accurate annotation of protein functions is important for a profound understanding of molecular biology. A large number of proteins remain uncharacterized because of the sparsity of available supporting information. For a large set of uncharacterized proteins, the only type of information available is their amino acid sequence. In this paper, we propose DeepSeq – a deep learning architecture – that utilizes only the protein sequence information to predict its associated functions. The prediction process does not require handcrafted features; rather, the architecture automatically extracts representations from the input sequence data. Results of our experiments with DeepSeq indicate significant improvements in terms of prediction accuracy when compared with other sequence-based methods. Our deep learning model achieves an overall validation accuracy of 86.72%, with an F1 score of 71.13%. Moreover, using the automatically learned features and without any changes to DeepSeq, we successfully solved a different problem i.e. protein function localization, with no human intervention. Finally, we discuss how this same architecture can be used to solve even more complicated problems such as prediction of 2D and 3D structure as well as protein-protein interactions. Introduction A biological cell is an intricately arranged chemical factory, the sophisticated organization of which results into multi-cellular species with staggering complexities. At molecular level proteins are the main workhorses of these living cells. The knowledge of protein functions is of paramount importance for an ample understanding of these complex molecular machines that ultimately govern life. Accurate identification of protein function has implications in a wide variety of areas, which includes, discovering new therapeutic interventions, understanding novel diseases as well as designing their cures, better agriculture etc. Experimental methods are generally used to study protein functions but they cannot scale up to the task, due to associated cost and time. On the other hand, high throughput experiments like next generation sequencing technologies are resulting in a large number of new protein sequences that remain uncharacterized1. Figure 1 shows the latest major database statistics, which reflect a rapidly increasing gap between known sequences (GenBank and EMBL curves) and structurally characterized sequences (PDB curve). This tremendous growth of uncharacterized proteins poses a serious challenge at the forefront of molecular biology. To overcome this, computational approaches are generally relied upon for the annotation of protein functions. Many techniques have been proposed in the recent past that exploit a wide variety of data to predict protein functions2–7. A 198 0 198 3 198 6 198 9 199 2 199 5 199 8 200 1 200 4 200 7 201 0 201 3 201 6 102 104 106 108 Year N um be r of En tr ie s in D at ab as e GenBank EMBL KEGG PDB Figure 1. Growth of major sequence, pathway and 3D structure databases1. The amount of available sequence information (GenBank, EMBL) is orders of magnitude greater than that associated with structure information (PDB) peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/168120 doi: bioRxiv preprint first posted online Jul. 25, 2017;
منابع مشابه
Folding membrane proteins by deep transfer learning
Computational elucidation of membrane protein (MP) structures is challenging partially due to lack of sufficient solved structures for homology modeling. Here, we describe a high-throughput deep transfer learning method that first predicts MP contacts by learning from non-MPs and then predicts 3D structure models using the predicted contacts as distance restraints. Tested on 510 non-redundant M...
متن کاملDetection of children's activities in smart home based on deep learning approach
Monitoring behavior of children in the home is the extremely important to avoid the possible injuries. Therefore, an automated monitoring system for monitoring behavior of children by researchers has been considered. The first step for designing and executing an automated monitoring system on children's behavior in closed spaces is possible with recognize their activity by the sensors in the e...
متن کاملDetection of children's activities in smart home based on deep learning approach
Monitoring behavior of children in the home is the extremely important to avoid the possible injuries. Therefore, an automated monitoring system for monitoring behavior of children by researchers has been considered. The first step for designing and executing an automated monitoring system on children's behavior in closed spaces is possible with recognize their activity by the sensors in the e...
متن کاملAutomated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques
MOTIVATION With the increase in submission of sequences to public databases, the curators of these are not able to cope with the amount of information. The motivation of this work is to generate a system for automated annotation of data we are particularly interested in, namely proteins related to the Mycoplasmataceae family. Following previous works on automatic annotation using symbolic machi...
متن کاملESG: extended similarity group method for automated protein function prediction
MOTIVATION Importance of accurate automatic protein function prediction is ever increasing in the face of a large number of newly sequenced genomes and proteomics data that are awaiting biological interpretation. Conventional methods have focused on high sequence similarity-based annotation transfer which relies on the concept of homology. However, many cases have been reported that simple tran...
متن کامل